Goto

Collaborating Authors

 self-consistency error


Estimating the Self-Consistency of LLMs

Nowak, Robert

arXiv.org Artificial Intelligence

Systems often repeat the same prompt to large language models (LLMs) and aggregate responses to improve reliability. Common approaches include self-consistency or simple majority voting (sample multiple outputs and choose the mode), prompt ensembling (rephrasing prompts to reduce wording sensitivity), and multi-agent debate (running multiple instances and aggregating their conclusions). Such consensus methods can stabilize outputs and improve accuracy, especially on multi-step reasoning tasks [1]. This short note analyzes an estimator of the self-consistency of LLMs and the tradeoffs it induces under a fixed compute budget B = mn, where m is the number of prompts sampled from the task distribution and n is the number of repeated LLM calls per prompt; the resulting analysis favors a rough split m,n B. It complements recent work on self-consistency prompting that aggregates multiple sampled reasoning paths to stabilize predictions [2, 3]. Consider a prompt x that requires a binary response.